Pharokka: a fast scalable bacteriophage annotation tool

Abstract Summary In recent years, there has been an increasing interest in bacteriophages, which has led to growing numbers of bacteriophage genomic sequences becoming available. Consequently, there is a need for a rapid and consistent genomic annotation tool dedicated for bacteriophages. Existing tools either are not designed specifically for bacteriophages or are web- and email-based and require significant manual curation, which makes their integration into bioinformatic pipelines challenging. Pharokka was created to provide a tool that annotates bacteriophage genomes easily, rapidly and consistently with standards compliant outputs. Moreover, Pharokka requires only two lines of code to install and use and takes under 5 min to run for an average 50-kb bacteriophage genome. Availability and implementation Pharokka is implemented in Python and is available as a bioconda package using ‘conda install -c bioconda pharokka’. The source code is available on GitHub (https://github.com/gbouras13/pharokka). Pharokka has been tested on Linux-64 and MacOSX machines and on Windows using a Linux Virtual Machine.


Introduction
As the number of bacteriophage (phage) sequences increases, there is a need for bioinformatic tools that enable fast, consistent and scalable genomic annotation (Cook et al., 2021). Existing tools such as RAST (Aziz et al., 2008;Davis et al., 2020), PHASTER (Arndt et al., 2016) and CPT Galaxy (Ramsey et al., 2020) are web-server based which may be laborious in particular when multiple phage genomes require annotation (Shen and Millard, 2021). On the other hand, currently available customizable bioinformatics pipelines such as multiPhATE2 require significant understanding of bioinformatics to implement and have lengthy run times (Ecale Zhou et al., 2021). Furthermore, command-line programs designed for viral discovery in metagenomic datasets such as Cenote-Taker 2 (Tisza et al., 2021), Hecatomb (Roach et al., 2022) and MetaPhage (Pandolfo et al., 2022) require significant computational resources and storage for database installation.
As a result, one-line prokaryotic genome annotation tools such as Prokka (Seemann, 2014) are often used for phages, especially where tens or hundreds of phages need to be annotated simultaneously (Beamud et al., 2022;Nordstrom et al., 2022). However, such tools implement prokaryotic gene prediction tools that are based on models that are not designed for phage genomes. In addition, phage genes are often lacking in their default databases, resulting in incomplete or hypothetical functional annotation of phage genes.
Inspired by Prokka, here we created Pharokka as a one-line tool tailored to phages that provides annotations in a fast, scalable and consistent fashion. Pharokka identifies predicted coding sequences (CDS), transfer RNAs (tRNAs), transfer-messenger RNAs (tmRNAs) and clustered regularly interspaced short palindromic repeats (CRISPRs), providing functional annotation for CDS using the PHROGs database (Terzian et al., 2021). With accessible bioconda installation, Pharokka is easily integrated into complex bioinformatic pipelines such as those created with Snakemake (Mö lder et al., 2021) or Nextflow (Di Tommaso et al., 2017).

Materials and methods
The Pharokka workflow is outlined in Figure 1.

Input
Pharokka requires assembled DNA sequences in FASTA format (Fig. 1A). For phage isolates, this usually consists of one complete contig, but Pharokka can also annotate incomplete assemblies or metavirome samples with multiple contigs in the multiFASTA 1 format. Furthermore, metagenomically assembled phage genomes and genomic contigs, derived using programs such as Virstorter2 (Guo et al., 2021), Hecatomb (Roach et al., 2022) and Cenote-taker 2 (Tisza et al., 2021), are also suitable to be annotated using Pharokka using meta mode.

Feature prediction
Pharokka uses PHANOTATE by default to predict CDS, as it is the only existing tool that is designed to predict genes in phage genomes Small phage genes are more prevalent than predicted based on existing non-phage specific annotation tools and may encode for anti-CRISPR and antimicrobial resistance proteins (Fremin et al., 2022). Alternatively, Pharokka users can specify Prodigal, a gene predictor designed for prokaryotic genes, be used for CDS prediction (Hyatt et al., 2010;Fig. 1B). Prodigal may be useful if users wish to annotate large metavirome datasets quickly using meta mode (Table 4). In addition, Pharokka employs tRNAscan-SE 2 (Chan et al., 2021) to predict tRNAs, Aragorn (Laslett and Canback, 2004) to predict tmRNAs and MinCED to predict CRISPRs (Bland et al., 2007;Fig. 1D). When meta mode is specified, Pharokka runs PHANOTATE on each contig separately, leading to considerable speed improvement when multiple threads are specified.

Functional gene annotation
Functional assignment of predicted CDS is conducted using the PHROGs database (Terzian et al., 2021;Fig. 1C). The PHROGs database contains 38 880 PHROGs (protein orthologous groups) containing 868 340 proteins from 17 473 complete genomes of viruses infecting bacteria or archaea that are grouped together using remote homology detection. Each PHROG is assigned to one of nine functional categories. Each gene within each PHROG has a specific product description if known. All predicted CDS are translated into a protein sequence and assigned to the closest matching protein and accompanying PHROG in the PHROGs database using the protein searching tool mmseqs2 (Steinegger and Sö ding, 2017). An e-value threshold of 10 À5 is used by default. Users can also specify their own e-value if desired. This is particularly useful if the phage is novel with few or no similar sequences in the PHROGs database, as a lower threshold can be used to detect less similar matches. If no PHROG match is found, then the CDS is annotated as 'No PHROG' and 'hypothetical protein'. All CDS that are annotated as 'terminase large subunit' are extracted into separate output files, as the terminase large subunit is commonly used for phylogenetic analysis (Al-Shayeb et al., 2020).

Virulence factor and antimicrobial resistance gene detection
While virulence factors are commonly found on prophages (Fortier and Sekulovic, 2013), lytic phages rarely encode antibiotic resistance genes (Enault et al., 2017;Cook et al., 2021). Nonetheless, screening is required for all phages that are intended to be used for phage therapy (Shen and Millard, 2021). Pharokka adds antimicrobial resistance and virulence gene detection using the Comprehensive Antibiotic Resistance Database (CARD) (Alcock et al., 2020)

Output
Pharokka's output files are outlined in Table 1. The primary output of Pharokka is a .gff file that is suitable for use in downstream pangenome programs such as Roary (Page et al., 2015) and panaroo (Tonkin-Hill et al., 2020; Fig. 1E). Other files include a .tbl file, which is a flat-file table suitable to be uploaded to the NCBI's Bankit, a cds_functions.tsv file, which includes counts of CDS, tRNAs, CRISPRs and tmRNAs and CDS within each functional category for each contig in the input FASTA file, a length_gc_cds_density.tsv file, which outputs the length, GC percentage and CDS coding density for each contig, a cds_final_merged_output.tsv, which gives the combined parsed output from mmseqs2 searches against the PHROGs, VFDB and CARD databases, and terL.faa and terL.ffn output files that contain the amino acid and nucleotide sequences of any predicted terminase large subunit genes. When run on metavirome input, Pharokka's contig-level summary output allows users to identify specific contigs within the metavirome that possess unusual features such as virulence factors, antimicrobial resistance genes or potential stop codon reassignment as indicated by low CDS coding density (Peters et al., 2022).

Results
To test the performance of Pharokka, we compared the run-time and annotations of Enterobacteria phage lambda (Genbank accession J02459), Staphylococcus phage SAOMS1 (Genbank accession MW460250) and 673 crAss-like metagenome-assembled phage genomes from the human gut (Yutin et al., 2021) with default Pharokka v1.1.0 using PHANOTATE as a gene predictor, Pharokka v1.1.0 specifying Prodigal as gene predictor and Prokka v1.14.6 using a version of the PHROGs HMM database that has been reformatted for use with Prokka (Millard, 2021 https://millardlab.org/ Benchmarking was conducted on an Intel V R Xeon V R CPU E5-4610 v2 @ 2.30 GHz specifying 16 threads for Pharokka and 16 cpus for Prokka. Tables 2 and 3 show that for phages Lambda and SAOMS1, Pharokka is slower than Prokka but still fast, finishing within 5 min regardless of gene predictor used, with PHANOTATE predicting more CDS with higher coding density than Prodigal. Table 4 shows that for the crAss-like phage genomes, Pharokka is considerably faster than using Prokka regardless of gene predictor, due to Pharokka using mmseqs2 for database searching rather than HMMER3 (Finn et al., 2011) employed by Prokka (Mirdita et al., 2019). As the size of the input increases, the extra-time cost incurred by Pharokka is due to the gene prediction and tRNA calling, rather than the database searching. For extremely large metavirome datasets, Pharokka in Prodigal meta mode is therefore recommended.
In addition, for the crAss-like phage genomes, the low coding density output of some contigs identified by Pharokka indicates that stop codon reassignment may be occurring in these contigs (Peters et al., 2022;Yutin et al., 2021).